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Abstract 

It has previously been hypothesized, and supported with some experimental 
evidence, that deeper representations, when well trained, tend to do a better job at 
disentangling the underlying factors of variation. We study the following related 
conjecture: better representations, in the sense of better disentangling, can be ex- 
ploited to produce faster-mixing Markov chains. Consequently, mixing would be 
more efficient at higher levels of representation. To better understand why and how 
this is happening, we propose a secondary conjecture: the higher-level samples fill 
more uniformly the space they occupy and the high-density manifolds tend to un- 
fold when represented at higher levels. The paper discusses these hypotheses and 
tests them experimentally through visualization and measurements of mixing and 
interpolating between samples. 



1 Introduction and Background 

Deep learning algorithms attempt to discover multiple levels of representation of the 
given data (see (Bengio, 2009) for a review), with higher levels of representation de- 
fined hierarchically in terms of lower level ones. The central motivation is that higher- 
level representations can potentially capture higher-level abstractions relevant to the 
distribution of interest. Mathematical results in the case of specific function families 
have shown that choosing a sufficient depth of representation can yield exponential 



benefits, in terms of size of the model, to represent some functions (Hastad 1986 
Hastad and Goldmann[ |1991[ [Bengio et al] |2006t [Bengio and LeCun[ |2007[ [Bengio 



and Delalleau 201 The intuition behind these theoretical advantages is that lower- 



level features or latent variables can be re-used in many ways to construct higher-level 
ones, and the potential gain becomes exponential with respect to depth of the circuit 
that relates lower-level features and higher-level ones (thanks to the exponential num- 
ber of paths in between). The ability of deep learning algorithms to construct abstract 
features or latent variables on top of the observed random variables therefore relies on 
this idea of re-use, which brings with it not only computational but also statistical ad- 
vantages in terms of statistical power, e.g., as already exploited in prior machine leam- 
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ing work such as multi-task learning algorithms ( |Caruana[ |1995t |Baxter| |1997[ 



Col- 



.ang 



llobert and Weston" "2008) and learning algorithms involving parameter sharing (f 
and Hinton, 1988 ; LeCun, |T989l ). 

There is another - less commonly discussed - motivation for deep representations, 
introduced in B engio| p009| ): the idea that they may, to some extent, help to disen- 
tangle the underlying factors of variation. Clearly, if we had learning algorithms that 
could do a good job of discovering and separating out the underlying causes and fac- 
tors of variation present in the data, it would make further processing (typically, taking 
decisions) much easier. One could even say that the ultimate goal of AI research is 
to build machines that can understand the world around us, i.e., disentangle the fac- 
tors and causes it involves, so progress in that direction seems important. If learned 
representations do a good job of disentangling the underlying factors of variation, it 
is clear that learning (on top of these representations, e.g., towards specific tasks of 
interest) becomes substantially easier because disentangling counters the effects of the 
curse of dimensionality. In fact, in the extreme case, and where we also observe the 
target variables and effects of decisions, there is no need for further learning, only good 
inference. Several observations have been made and reported that suggest that some 
deep learning algorithms indeed help to disentangle the underlying factors of varia- 
tion ( Goodfellow et a/.[|2009[ Glorot etaL\ \20l I). However, it is not clear why, and to 
what extent in general (if any), different deep learning algorithms may sometimes help 
this disentangling. 

Most deep learning algorithms are based on some form of unsupervised learning, 
hence capturing salient structure in the data distribution. Whereas deep learning algo- 
rithms have mostly been used to learn features and exploit them for classification or 
regression tasks, their unsupervised nature also means that in several cases they can 
be used to generate samples. In general the associated sampling algorithms involve a 
Markov Chain and MCMC techniques, and these can notoriously suffer from a funda- 
mental problem of mixing: it is difficult for the Markov chain to jump from one mode 
of the distribution to another, when these are separated by large low-density regions, a 



common situation in real- world data, and under the manifold hypothesis ( Cayton 2005 



Narayanan and Mitter[|2010| . This hypothesis states that natural classes present in the 



data are associated with low-dimensional regions in input space (manifolds) near which 
the distribution concentrates, and that different class manifolds are well- separated by 
regions of very low density. Slow mixing means that consecutive samples tend to be 
correlated (belong to the same mode) and that it takes many consecutive sampling steps 
to go from one mode to another and even more to cover all of them, i.e., to obtain a 
large enough representative set of samples (e.g. to compute an expected value under the 
target distribution). This happens because these jumps through the empty low-density 
void between modes are unlikely and rare events. When a learner has a poor model 
of the data, e.g., in the initial stages of learning, the model tends to correspond to a 
smoother and higher-entropy (closer to uniform) distribution, putting mass in larger 
volumes of input space, and in particular, between the modes (or manifolds). This can 
be visualized in generated samples of images, that look more blurred and noisy. Mix- 
ing is therefore initially easy for such poor models. However, as the model improves 
and its corresponding distribution sharpens near where the data concentrate, mixing 
becomes considerably slower. Since sampling is an integral part of many learning 
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algorithms (e.g., to estimate the log-HkeHhood gradient), slower mixing then means 
slower or poorer learning, and one may even suspect that learning stalls at some point 
because of the limitations of the sampling algorithm. To improve mixing, a powerful 
idea that has been explored recently for deep learning algorithms ( Desjardins et al.\ 
20T0I |Cho et al\ [20T0l [Salakhutdinov] |2010b|a| ) is tempering ( |NeaI||1994| ). The idea 



is use smoother densities (associated with higher temperature in a Boltzmann Machine 
or Markov Random Field formulation) to make quick but approximate jumps between 
modes, but use the sharp "correct" model to actually generate the samples of interest 
inside these modes, and allow samples to be exchanged between the different levels of 
temperature. 

Here we want to discuss another possibly related idea, and claim that mixing is eas- 
ier when sampling at the higher levels of representation. The objective is not to propose 
a new sampling algorithm or a new learning algorithm, but rather to investigate this hy- 
pothesis through experiments using existing deep learning algorithms. The objective 
is to further our understanding of this hypothesis through more specific hypotheses 
aiming to explain why this would happen, using further experimental validation to test 
these more specific hypotheses. The idea that deeper generative models produce not 
only better features for classification but also better quality samples (in the sense of 
better corresponding to the target distribution being learned) is not novel and several 



observations support this hypothesis already, some quantitatively ( [Salakhutdinov and 



Hinton 2009), some more qualitative ( [Hinton et al. \ 2006| ). The specific contributions 



of this paper is to focus on why the samples may be better, and in particular, why the 
chains may converge faster, based on the previously introduced idea that deeper repre- 
sentations can do a better job of disentangling the underlying factors of representation. 



2 Hypotheses 

We first clarify the first hypothesis to be tested here. 

Hypothesis HI: Depth vs Better Mixing. A successfully trained 
deeper architecture has the potential to yield representation spaces in 
which Markov chains mix faster. 

If experiments validate that hypothesis, the most important next question is: why? 
The main explanation we conjecture is formalized in the following hypothesis. 

Hypothesis H2: Depth vs Disentanghng. Part of the explanation of 
HI is that deeper representations can better disentangle the underly- 
ing factors of variation. 

Why would that help to explain HI? Imagine an abstract (high-level) representation 
for object image data in which one of the factors is the "reverse video bit", which 
inverts black and white, e.g., flipping that bit replaces intensity x G [0, 1] by 1 — x. 
With the default value of 0, the foreground object is dark and the background light. 
Clearly, flipping that bit does not change most of the other semantic characteristics 
of the image, which could be represented in other high-level features. However, for 
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every image-level mode, there would be a reverse-video counterpart mode in which 
that bit is flipped: these two modes would be separated by vast "empty" (low density) 
regions in input space, making it very unlikely for any Markov chain in input space 
(e.g. Gibbs sampling in an RBM) to jump from one of these two modes to another, 
because that would require most of the input pixels or hidden units of the RBM to 
simultaneously flip their value. Instead, if we consider the high-level representation 
which has a "reverse video" bit, flipping only that bit would be a very likely event under 
most Markov chain transition probabilities, since that flip would be a small change 
preserving high probability. As another example, imagine that some of the bits of the 
high-level representation identify the category of the object in the image, independently 
of pose, illumination, background, etc. Then simply flipping one of these object-class 
bits would also drastically change the raw pixel-space image, while keeping likelihood 
high. Jumping from an object-class mode to another would therefore be easy with a 
Markov chain in representation- space, whereas it would be much less likely in raw 
pixel-space. 

Another point worth discussing (and which should be considered in future work) 
in H2 is the notion of degree of disentangling. Although it is somewhat clear what 
a completely disentangled representation would look like, deep learning algorithms 
are unlikely to do a perfect job of disentangling, and current algorithms do it in stages, 
with more abstract features being extracted at higher levels. Better disentangling would 
mean that some of the learned features have a higher mutual information with some of 
the known factors. One would expect at the same time that the features that are highly 
predictive of one factor be less so of other factors, i.e., that they specialize to one or 
a few of the factors, becoming invariant to others. Please note here the difference 
between the objective of learning disentangled representations and the objective of 
learning invariant features (i.e., invariant to some specific factors of variation which 
are considered to be like nuisance parameters). In the latter case, one has to know 
ahead of time what the nuisance factors are (what is noise and what is signal?). In the 
former, it is not needed: we only seek to separate out the factors from each other. Some 
features should be sensitive to one factor and invariant to the others. 

Let us now consider additional hypotheses that specialize H2. 

Hypothesis H3: Disentangling Unfolds and Expands. Part of the 
explanation of H2 is that more disentangled representations tend to 

(a) unfold the manifolds near which raw data concentrates, as well as 

(b) expand the relative volume occupied by high-probability points 
near these manifolds. 

H3(a) says is that disentangling has the effect that the projection of high-density man- 
ifolds in the high-level representation space are smoother and easier to model than the 
corresponding high-density manifolds in raw input space. Let us again use an object 
recognition analogy. If we have perfectly disentangled object identity, pose and il- 
lununation, the high-density manifold associated with the distribution of features in 
high-level representation- space is flat: we can make large moves in that space (e.g., 
completely change the lighting) and yet stay in a high-probability region. In fact we 
can make any move inside some bounds and convex constraints and stay in a high- 
probability region, so that the distribution of high-level features may look locally more 
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uniform, which is a consequence of H3(b). A good high-level representation does not 
need to allocate as much real estate (sets of values) for unlikely configurations. This is 
already what most unsupervised learning algorithms tend to do. For example, dimen- 
sionality reduction methods such as the PCA tend to define representations where most 
configurations are likely (but these only occupy a subspace of the possible raw-space 
configurations). Also, in clustering algorithms such as k-means, the training criterion 
is best minimized when clusters are approximately equally- weighted, i.e., the average 
posterior distribution over cluster identity is approximately uniform. Something simi- 
lar is observed in the brain where different areas of somatosensory cortex correspond 
to different body parts, and the size of these areas adaptively depends (Flor, |2Q03| on 
usage of these (i.e., more frequent events are represented more finely and less frequent 
ones are represented more coarsely). Again, keep in mind that the actual representa- 
tions learned by deep learning algorithms are not perfect, but what we will be looking 
for here is whether deeper representations correspond to more unfolded manifolds and 
to more locally uniform distributions, with high-probability points occupying an over- 
all greater volume (compared to the available volume). 



3 Representation-Learning Algorithms 

The learning algorithms used in this paper to explore the preceding hypotheses are 
the Deep Belief Network or DBN ( Hinton et a/.||20Q6 ^, trained by stacking Restricted 



Boltzmann Machines or RBMs, and the Contractive Auto-Encoder or CAE ( Rifai et al. 



2011a), for which a sampling algorithm was recently proposed ( Rifai et al 



See |Bengio| pOQ9) for a detailed review of RBMs and DBNs. Each layer of the 



DBN is trained as an RBM, and a 1 -layer DBN is just an RBM. An RBM defines 
a joint distribution between a hidden layer h and a visible layer v. Gibbs sampling 
at the top level of the DBN is used to obtain samples from the model: the sampled 
top-level representations are stochastically projected down to lower levels through the 
conditional distributions P{v\h) defined in each RBM. To avoid unnecessary additional 
noise, and like previous authors have done, at the last stage of this process (i.e. to obtain 
the raw-input level samples), only the mean-field values of the visible units are used, 
i.e., E[v\h]. In the experiments on face data (where grey levels matter a lot), a Gaussian 
RBM is used at the lowest level. 

An auto-encoder ((LeCun| |1987[ |Hinton and Zemel[ |1994|) is parametrized through 



an encoder function / mapping an input- space vector x to a representation- space vector 
h, and a decoder function g mapping a representation- space vector h to an input- space 
reconstruction r. The experiments with the CAE are with h = f{x) = sigmoid (VKx + 
b) and r = g{h) = sigmoid(W^/i + c). The CAE is a regularized auto-encoder, with 
tied weights (input to hidden weights are the transpose of hidden to reconstruction 
weights). The CAE is trained to minimize a reconstruction loss (cross-entropy here) 
plus a contractive regularization penalty a\ \ \\f (which is the sum of the elements 
of the Jacobian matrix). Like RBMs, CAE layers can be stacked to form deeper mod- 
els, and one can either view them as deep auto-encoders (by composing the encoders 
together and the decoders together) or like in a DBN, as a top-level model (from which 
one can sample) coupled with encoding and decoding functions into the top level (by 
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composing the lower-level encoders and decoders). A sampling algorithm was recently 
proposed for CAEs ( Rifai et al. 2012 ). The idea is to alternate between projecting 
through the auto-encoder (i.e. performing a reconstruction) and adding Gaussian noise 
JJ^e in the directions of variation captured by the auto-encoder (in the Jacobian ma- 
trix J = of the encoder function). 



4 Experiments 

The experiments have been performed on the MNIST digits dataset fLeCun et aL\ 
[1998) and the Toronto Face Database ( [Susskind et al.\ [2010) , TFD. The former has 
been heavily used to evaluate many deep learning algorithms, while the latter is inter- 
esting because of the manifold structure it displays, and for which control factors (such 
as emotion and person identity) are known. 

The DBNs tested on MNIST have 768-1024-1024 layer sizes (28x28 input), and 
2304-512-1024 on TFD (48x48 input). The CAEs have sizes 768-1000-1000 and 
2304-1000-1000 on MNIST and TFD respectively. 



4.1 Sampling at Different Depths 
4.1.1 Better Samples at Higher Levels 

To test HI, we first plot sequences of samples at various depths. One can verify in 
Fig. [T] that samples obtained at deeper layers are visually more likely and mix faster. 

In addition, we measure the quality of the obtained samples, using a procedure for 
the comparison of sample generators described in Breuleux et al. ( |2011 ). It measures 



the log-likelihood of a test set under the density computed from a Parzen window den- 
sity estimator built on 10, 000 generated samples. Log-likelihoods for different models 
are presented in Table [T] (rightmost columns). Those results also suggest that the qual- 
ity of the samples is higher if the Markov chain process used for sampling takes place 
in the upper layers. 

This observation agrees with H3(b) that moving in higher-level representation spaces 
where the manifold has been expanded provides higher quality samples than moving 
in the raw input space where it may be hard to stay in high density regions. 



4.1.2 Visualizing Representation-Space by Interpolating Between Neighbors 

According to H3(a), deeper layers tend to locally unfold the manifold near high- 
densities regions of the input space, while according to H3(b) there should be more rel- 
ative volume taken by plausible configurations in representation-space. Both of these 
would imply that convex combinations of neighboring examples in representation- 
space correspond to more likely input configurations. Indeed, interpolating between 
points on a flat manifold should stay on the manifold. Furthermore, when interpolating 
between examples of different classes (i.e., different modes), H3(b) would suggest that 
most of the points in between (on the linear interpolation line) should correspond to 
plausible samples, which would not be the case in input space. In Fig.|2] we interpolate 
linearly between neighbors in representation space and visualize in the input space the 
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Figure 1 : Sequences of 25 samples generated with a CAE on TFD (rows 1 and 2) and 
with an RBM on MNIST (rows 3 and 4). On TFD, the second layer clearly allows to get 
quickly from woman samples (left) to man samples (right) passing by various facial expressions 
whereas the first layer shows poor samples. Bottom rows: On MNIST, the first layer gets stuck 
in the same sample while the second layer allows to mix among classes. 

interpolated points obtained at various depths. One can see that interpolating at deeper 
levels gives visually more plausible samples. 

4.2 Measuring Mixing by Counting Number of Classes Visited 

We evaluate here the ability of mixing among various classes. We consider sequences 
of length 10, 20 or 100 and compute histograms of number of different classes visited 
in a sequence, for the two different depths and learners, on TFD. Since classes typically 
are in different modes (manifolds), counting how many different classes are visited in 
an MCMC run tells us how quickly the chain mixes. Results in Fig.[3]^c,f) show that 
the deeper architectures visit more classes and the CAE mixes faster than the DEN. 

4.3 Occupying More Volume Around Data Points 

In these experiments (Fig.[3](a,b,d,e)) we estimate the quality of samples whose rep- 
resentation is in the neighborhood of training examples, at different levels of represen- 
tation. In the first setting (Fig. |3] (a,b)), the samples are interpolated at the midpoint 
between an example and its /c-th nearest neighbor, with k on the x-axis. In the second 
case (Fig. [3] (d,e)), isotropic noise is added around an example, with noise standard 
deviation on the x-axis. In both cases, 500 samples are generated for each data point 
plotted on the figures, with the y-axis being the log-likelihood introduced earlier, i.e., 
estimating the quality of the samples. We find that on higher-level representations of 
both the CAE and DBN, a much larger proportion of the local volume is occupied 
by likely configurations, i.e., closer to the input- space manifold near which the actual 
data-generating distribution concentrates. Whereas the first experiment shows that this 
is true in the convex set between neighbors at different distances (i.e., in the directions 
of the manifold), the second shows that this is also true in random directions (but of 
course likelihoods are also worse there). The first result therefore agrees with H3(a) 
(unfolding) and H3(b) (volume expansion), while the second result mostly confirms 
H3(b). 
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(a) Interpolating between an example and its 200-th nearest neighbor 
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(b) Interpolating between an example and its nearest neighbor 
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(c) Sequences of points interpolated at different depths 



Figure 2: Linear interpolation between a data sample and the 200-th (a) and 1st (a) 
nearest neighbor, at various depths (top row=input space, middle row=lst layer, bottom 
row=2nd layer). In each 3x3 block the left and right columns are test examples while 
the middle column is the interpolated point's input image. Interpolating at higher lev- 
els clearly gives more likely samples. Especially in the raw input space (a, 2nd block), 
one can see two mouths overlapping while only one mouth appears for the interpolated 
point at the 2nd layer. Interpolating with the 1 -nearest neighbor does not show any dif- 
ference between the levels. In (c), we interpolate between samples of different classes, 
at different depths (top=raw input, middle=lst layer, bottom=2nd layer). Note how in 
lower levels one has to go through unplausible patterns, whereas in the deeper layers 
one almost jumps from a high-density region (of one class) to another. 
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Figure 3: (a) (b) Local Convex Hull - Log-likelihoods computed w.r.t. linearly in- 
terpolated samples between an example and its k-NNs, for k between 1 and 500. The 
manifold is unfolded in deeper levels, (d) (e) Local Convex Ball - Log-likelihoods of 
samples generated by adding Gaussian noise to the representation at different levels 
(a G [0.01, 5]). More volume is taken by good samples on deeper layers, (c) (f) Mix- 
ing Histograms - number of classes visited (x-axis) over 10 samples (c), 20 samples (f) 
with CAE, 100 samples (f) with DBN. 
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4.4 Discriminative Ability vs Volume Expansion 

The proposed hypotheses could arguably correspond to worse discriminative power. 
Indeed, if on the higher-level representations the different classes are "closer" to each 
other (making it easier to mix between them), would it not mean that they are more 
confusable? We first confirm with the tested models (as a sanity check) that the deeper 
level features are conducive to better classification performance, in spite of their better 
generative abilities and better mixing. 

We train a linear SVM on the concatenation of the raw input with the upper layers 
representations. Results presented in Table [T] show that the representation is more 
linearly separable if one increases the depth of the architecture and the information 
added by each layer is helpful for classification. Also, fine-tuning a MLP initialized 
with those weights is still the best way to reach state-of-the-art performances. 





Classification 


Log-likelihood 




MN 

SVM 


1ST 

MLP+ 


TI 

SVM 


MLP+ 


MNIST 


TFD 


raw 


8.34% 




33.48 ±2.14% 








CAE-1 
CAE-2 


1.97% 
1.73% 


1.14% 
0.81% 


25.44 ±2.45% 
24.76 ±2.46% 


24.12 ±1.87% 
23.73 ±1.62% 


67.69 ±2.87 

121.17 ±1.59 


591.90 ±12.27 
2110.09 ±49.15 


DBN-1 
DBN-2 


1.62% 
1.33% 


1.21% 
0.99% 


26.85 ±1.62% 
26.54 ±1.91% 


28.14±1.40 
27.79 ±2.34 


-243.91 ±54.11 
137.89 ±2.11 


604 ±14.67 
1908.80 ±65.94 



Table 1 : Left: Classification rates of various classifiers using representations learned 
on the MNIST and TFD datasets. The DEN 0.99% error on MNIST has been obtained 
with a 3-layers DEN and the 0.81% error with the Manifold tangent Classifier (,Rifai| 
et a/.||2011b ) that is based on a CAE-2 and discriminant fine-tuning. MLP+ uses dis- 
criminant fine-tuning. Right: Log-likelihoods from Parzen-Windows density estima- 
tors based on 10, 000 samples generated by each model. This quantitatively confirms 
that the samples generated from deeper levels are of higher quality, in the sense of 
better covering the zones where test examples are found. 

To explain the good discriminant abilities of the deeper layers (either when con- 
catenated with lower layers or when fine-tuned discriminatively) in spite of the better 
mixing observed, we conjecture the help of a better disentangling of the underlying fac- 
tors of variation, and in particular of the class factors. This would mean that the man- 
ifolds associated with different classes are more unfolded (as assumed by H3(a)) and 
possibly that different hidden units specialize more to specific classes than they would 
on lower layers. Hence the unfolding (H3(a)) and disentangling (HI) hypotheses rec- 
oncile better discriminative ability with expanded volume of good samples (H3(b)). 

5 Conclusion 

The following hypotheses were tested: (1) deeper representations can yield better sam- 
ples and better mixing; (2) this is due to better disentangling; (3) this is associated with 
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unfolding of the manifold where data concentrate along with expansion of the volume 
good samples take in representation-space. The experimental results were in agreement 
with these hypotheses. They showed better samples and better mixing on higher levels, 
better samples obtained when interpolating between examples at higher levels, and bet- 
ter samples obtained when adding isotropic noise at higher levels. We also considered 
the potential conflict between the third hypothesis and better discrimination (confirmed 
on the models tested) and explained it away as a consequence of the second hypothesis. 

This could be immediate good news for applications requiring to generate MCMC 
samples: by transporting the problem to deeper representations, better and faster results 
could be obtained. Future work should also investigate the link between better mixing 
and the process of training deep learners themselves, when they depend on an MCMC 
to estimate the log-likelihood gradient. One interesting direction is to investigate the 
relation between parallel tempering and the better mixing chains obtained from deeper 
layers. 
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